Description of the problem

In this project I analysed a dataset regarding deaths in Milan. The dataset contains, for each day between 1/01/1980 and 31/12/1989:

We have for each day some explanatory variables too:

The dataset has been described in Vigotti, M.A., Rossi, G., Bisanti, L., Zanobetti, A. and Schwartz, J. (1996). Short term effect of urban air pollution on respiratory health in Milan, Italy, 1980-1989. Journal of Epidemiology and Community Health, 50, 71-75. and it has been analysed in Ruppert, D., Wand, M.P. and Carroll, R.J. (2003), Semiparametric Regression Cambridge University Press..

The aim of the project is to build a predictive model for tot.mort and resp.mort and to understand which are the variables that mostly affect the probability of death.

Explorative Data Analysis

Univariate descriptive analysis

Response variables

In the following plots it is represented the distribution of tot.mort and resp.mort. As we can see, the determinations of tot.mort are big enough to model it as a continuous variable, while the determinations of resp.mort are much lower.

Milan population

In the following plot it is represented the evolution of the Milan population, according to the ISTAT census. The data can be found here. As we can see, the population between 1980 and 1990 decreased. In order to take into account the decrement of the number of people subjected to the risk, I computed the population day by day with a linear interpolation of the census dates and used this piece of data as an exposure in the models. With this information I computed the mortality rate as: \[ Mortality Rate = \frac{tot.mort}{population} \] Note that this is just a generic mortality rate, not a specific one. So, we are not considering the age distribution of the population. Given that within the period 1980-1990 the Milan population aged, even If the mortality at each age unchanged, the generic mortality rate would have increased.

Explanatory variables

In the following plots it is represented the distribution of the explanatory variable. As we can see, SO2 is strongly skewed. To deal with this problem, I logarithmically transformed it. Specifically, I computed the following transformation:

SO2_log = log(SO2 + 30)

Multivariate descriptive analysis

Seasonal effect

Mortality

In the following plots we can see the trend of the mortality rate and the respiratory mortality rate through time. As we can see there is a strong seasonal effect, with a higher mortality in winter and a lower mortality in summer. We can spot some outliers in summer 1983 and in the first quarter 1986. In those periods the mortality was much higher than the one registered in the same period in other years.

In the following plot we can compare the mortality pattern through the year between different years. As we can see, they are quite similar.

Explantory variables

In the following plots we can see the trend of the explanatory variables through time. As we can see, all these variables have a strong dependency with time. That means that an eventual strong correlation between these variables and the mortality could be due to a spurious correlation. For example, during winter:

  • people work more, so they are more stressed and the accidents on the job are more frequent;
  • the light time is shorter, and that has an impact on the health;
  • there is more traffic, so the car accidents are more frequent.

To deal with this problem, I inserted the period of the year in the model with a spline computed on the day of the year.

Observing the mean.temp, trend we can see that in summer 1983, it was hotter than other summers. That could partially explain the higher mortality of that period.

Relationships between variables

In the following plot we can see the relationships between the variables. As we can see many of them are not linear, therefore GLM could perform poorly and a GAM could be more suitable. In particular, from the plot that represents the mean temperature and the mortality rate (mean.temp, tot_mort_prob), we can see that the mortality rate is higher when it is cold and decrease with warmer temperature, but it rapidly increase with really high temperature.

Statistical Models

Conclusions

Possible improvements